This tutorial describes the steps required to create a solution for categorizing and extracting data.
The CRM system of ACME company is currently handling a variety of document types, such as invoices, resumes, purchase orders, and payroll statements. There is a need for an automated solution that can intelligently categorize these diverse documents and extract pertinent data based on their respective categories.
-> Check the Prerequisites page.
An enumerable of DocumentTemplate objects must be created.
This collection will represent document templates, serving as the comprehensive definitions for specific types of documents, applicable to both the classification and extraction processes.
XtractFlow Document templates selection in csharp |
Copy Code |
---|---|
static List<DocumentTemplate> setupDocumentTemplates() { List<DocumentTemplate> templates = new List<DocumentTemplate>(); templates.Add(DocumentTemplates.Invoice); //adding invoice preset. templates.Add(DocumentTemplates.Resume); //adding resume preset. templates.Add(DocumentTemplates.PurchaseOrder); //adding purchase order preset. templates.Add(DocumentTemplates.PayrollStatement); //adding payroll statement preset. return templates; } |
Create a ProcessorComponent object, which is a necessary component for the processor.
This object will encapsulate the document processing workflow's logic.
XtractFlow ProcessorComponent generation in csharp |
Copy Code |
---|---|
static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = true, // enabling classification. EnableFieldsExtraction = true, // enabling extraction. Templates = setupDocumentTemplates() }; } |
At this point, it is necessary to instantiate a DocumentProcessor object and invoke the Process method to initiate the inference process.
Subsequently, a ProcessorResult object will be returned, encompassing the processing outcome.
Using XtractFlow DocumentProcessor in csharp |
Copy Code |
---|---|
// building the component ProcessorComponent component = buildComponent(); // processing all documents foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH])) { ProcessorResult result = new DocumentProcessor().Process(documentFile, component); // analyzing results if (result.Template != null) { Console.WriteLine("Document category:" + result.Template.Name); if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } } } } |
Using XtractFlow to achieve classification and data extraction |
Copy Code |
---|---|
static void runExtraction() { Configuration.RegisterGdPictureKey("GDPICTURE_KEY"); Configuration.RegisterLLMProvider(new OpenAIProvider(OPENAI_KEY)); Configuration.ResourcesFolder = "resources"; // building the component ProcessorComponent component = buildComponent(); // processing all documents foreach (string documentFile in Directory.GetFiles([DIRECTORY_PATH])) { ProcessorResult result = new DocumentProcessor().Process(documentFile, component); // analyzing results if (result.Template != null) { Console.WriteLine("Document category:" + result.Template.Name); if (result.ExtractedFields != null) { foreach (var item in result.ExtractedFields) { Console.WriteLine($"Field name: '{item.FieldName}' - Field value: '{item.Value}' - Validation state: ({item.ValidationState})"); } } } } } static ProcessorComponent buildComponent() { return new ProcessorComponent() { EnableClassifier = true, // enabling classification. EnableFieldsExtraction = true, // enabling extraction. Templates = setupDocumentTemplates() }; } static List<DocumentTemplate> setupDocumentTemplates() { List<DocumentTemplate> templates = new List<DocumentTemplate>(); templates.Add(DocumentTemplates.Invoice); //adding invoice preset. templates.Add(DocumentTemplates.Resume); //adding resume preset. templates.Add(DocumentTemplates.PurchaseOrder); //adding purchase order preset. templates.Add(DocumentTemplates.PayrollStatement); //adding payroll statement preset. return templates; } |